Brownian Bandits

نویسندگان

  • AVI MANDELBAUM
  • ROBERT J. VANDERBEI
چکیده

We consider a multi-armed bandit whose arms are driven by Brownian motions: the state of arm i is modeled as a one-dimensional Brownian motion B, i = 1, . . . , d. At each instant in time, a gambler must decide to pull some subset of these d arms, holding the others fixed at their current state. If arm i is pulled when B is in state xi, the gambler accumulates reward at rate ri(xi). The goal is to find a strategy that maximizes the accumulated discounted reward over an infinite horizon (with discount rate λ). Let Γi(xi) = sup τ>0 Exi ∫ τ 0 eri(B i t)dt Exi ∫ τ 0 e−λtdt , where the supremum is over all stopping times τ of B. Put Mi = {x ∈ IR : Γi(xi) = max j Γj(xj)}. From prior work, one expects that an optimal control is to pull arm i when the state of the bandit belongs to Mi. Equivalently, the optimal strategy follows the leader among the processes Γi(B ), i = 1, . . . , d. Such results have been established for bounded monotone reward functions. In this paper we extend the scope to cover certain unimodal functions. At the same time, we develop a framework within which general rewards and diffusions can be studied.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Asymptotic optimal control of multi-class restless bandits

We study the asymptotic optimal control of multi-class restless bandits. A restless bandit is acontrollable process whose state evolution depends on whether or not the bandit is made active. Theaim is to find a control that determines at each decision epoch which bandits to make active in orderto minimize the overall average cost associated to the states the bandits are in. Sinc...

متن کامل

Resourceful Contextual Bandits

We study contextual bandits with ancillary constraints on resources, which are common in realworld applications such as choosing ads or dynamic pricing of items. We design the first algorithm for solving these problems that improves over a trivial reduction to the non-contextual case. We consider very general settings for both contextual bandits (arbitrary policy sets, Dudik et al. (2011)) and ...

متن کامل

Semi-Bandits with Knapsacks

We unify two prominent lines of work on multi-armed bandits: bandits with knapsacks and combinatorial semi-bandits. The former concerns limited “resources” consumed by the algorithm, e.g., limited supply in dynamic pricing. The latter allows a huge number of actions but assumes combinatorial structure and additional feedback to make the problem tractable. We define a common generalization, supp...

متن کامل

Matroid Bandits: Practical Large-Scale Combinatorial Bandits

A matroid is a notion of independence that is closely related to computational efficiency in combinatorial optimization. In this work, we bring together the ideas of matroids and multiarmed bandits, and propose a new class of stochastic combinatorial bandits, matroid bandits. A key characteristic of this class is that matroid bandits can be solved both computationally and sample efficiently. We...

متن کامل

An algorithm with nearly optimal pseudo-regret for both stochastic and adversarial bandits

We present an algorithm that achieves almost optimal pseudo-regret bounds against adversarial and stochastic bandits. Against adversarial bandits the pseudo-regret is O ( K √ n log n ) and against stochastic bandits the pseudo-regret is O ( ∑ i(log n)/∆i). We also show that no algorithm with O (log n) pseudo-regret against stochastic bandits can achieve Õ ( √ n) expected regret against adaptive...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013